Behind the Bait: Delving into PhishTank's hidden data

Phishing constitutes a form of social engineering that aims to deceive individuals through email communication. Extensive prior research has underscored phishing as one of the most commonly employed attack vectors for infiltrating organizational networks. A prevalent method involves misleading the target by employing phishing URLs concealed through hyperlink strategies. PhishTank, a website employing the concept of crowd-sourcing, aggregates phishing URLs and subsequently verifies their authenticity. In the course of this study, we leveraged a Python script to extract data from the PhishTank website, amassing a comprehensive dataset comprising over 190,0000 phishing URLs. This dataset is a valuable resource that can be harnessed by both researchers and practitioners for enhancing phish- ing filters, fortifying firewalls, security education, and refining training and testing models, among other applications.


Data Specification
For enhanced clarity, the data specifications are presented in Table 1 .Specific Subject Area Information Security, Artificial Intelligence 3.
Data Format Raw csv file 4.
Type of Data Table 5.

Data Collection How data was acquired
Data were extracted from publicly available lists of phishing and legitimate URL on the phishtank website.6.

Value of Data
• Over 190,0 0 0 0 URLs have been extracted, providing an extensive dataset for re-searchers and practitioners to utilize.• The dataset encompasses instances of phishing URL obtained from PhishTank1 .This data is suitable for utilization within various machine learning processes, including model training and prediction, among others.• The dataset exhibits versatility and can serve various purposes, including: -Training classifiers for identifying phishing URLs.
-Serving as an evaluation benchmark.
-Facilitating the development of browser plugins or email filters.
-Training reinforcement learning agents.
-Examining trends over time.
-Enhancing phishing education effort s. • Machine learning and data mining researchers, as well as information security professionals, can derive significant value from these datasets.The data serves as a valuable resource for developing firewall solutions, intelligent ad-blocking mechanisms, and systems for the detection of malware.Below mentioned are several ways in which the phishing URL dataset could contribute to the enhancement of firewall solutions against phishing attacks: -Generating URL blocklists.
-Enhancing URL pattern recognition.
-Facilitating the development of learning models.
-Assessing the firewall's performance.• These datasets can be employed for educational purposes in the classroom to : -instruct on the differentiation between phishing and legitimate URLs.
-conduct user-centered analysis on the extracted Phished URLs.
-Creating phishing awareness games and simulations.
-Engaging in class activities to practice identifying and analyzing phishing URLs.  1 .

Data Description
• The dataset is contained in CSV files named in the format Valid Phishes Offline (n).xlsx .The small glimpse is presented in the Table 2 .Additionally, a comprehensive explanation of each attribute within these files is provided below to facilitate reader comprehension: -ID -The unique identifier assigned to the submitted URL on PhishTank.
-Phish URL -Link to view the phishing URL report on PhishTank website.URL can be further explored on the phishtank website by visiting the given link in the column.-Phish -The actual submitted URL submitted on the website.
-Submitted (Information) -Data and Time when the URL was submitted to PhishTank.
-Submitted User Link -The web-link to the Phishtank user who submitted the phishing information.-Online/Offline -Indicates if the phishing URL is still active or offline at the time of data collection.

Experimental Design, Materials and Methods
Social engineers [1] manipulate individuals working for the organization by disseminating infected links, files, or malware (phishing attacks [ 2 , 3 ]).Through these deceptive tactics, the unwitting human assets inadvertently grant social engineers unauthorized access to the system.
In the course of compiling the phishing URL datasets, we employed a Python script to retrieve phishing URL data from the PhishTank website 2 .The extraction procedure is explicated in the paper as an pseudo-code and algorithm.Throughout this procedure, we obtained a comprehensive phishing URL dataset comprising over 190,0 0 0 0 entries.To enhance readability and evaluation, the acquired list was subsequently annotated.The culmination of this effort resulted in the creation of three CSV files, each containing extracted features [4] .These CSV files are convenient and compatible with various tools and programming libraries, facilitating ease of use and analysis.
The following steps elucidate the process for regenerating data from the provided script or code file located within the designated folder.
1. Begin by installing the most recent version of Python.2. Access the PhishTank website and utilize the filters3 provided on the site for conduct-ing searches.3. Copy the updated website link and paste it into the Python code provided, as illus-trated in the following example.
• Generate a new CSV file.
• Name the newly created file as "Data.csv." 3. Execute the program or script in the Command Prompt (Cmd).

Code Snippet:
The subsequent code snippet is employed for extracting data from the Phish-Tank website.

Limitation
One limitation of this study pertains to the character length of reported phishing URLs within PhishTank.While the study imposed a limit of 70 characters, certain URLs listed on PhishTank exceeded this threshold.Consequently, only the initial 70 characters of such lengthy URLs were captured in the dataset.Nevertheless, it is worth noting that even with this limitation, the retrieved URLs provide a substantial amount of pertinent information.For instance, the URL provided as an example 4 exceeds 70 characters in length; however, due to the imposed constraint, only the initial 70 characters were extracted in the URL.

Ethics Statement
This study confirms that the current work does not involve human subjects, animal experiments, or any data collected from social media platforms thus did not require any approval.

Table 1
Specification Table.
Implement established phishing URL detection methods like blacklists, URL rule-based classifiers, existing ML models as baselines.Train and optimize them on the training set.-Developnew phishing URL detection techniques -this could be new engineered features, different ML algorithms like DNNs, ensemble models, etc. Train the novel models on the training set.-Evaluate the performance of both the baseline and new models on the common hold-out test set.Metrics like accuracy, precision, recall, F1-score, ROC curve can be used.-Thebenchmarks allow directly comparing how the novel techniques stack up against existing standard approaches.The metrics quantify the gains achieved by the new methods.-Analyzethe errors made by both old and new techniques to understand why they succeedor fail in certain cases.This provides insights into limitations and areas for improvement.-The benchmarks help assess which novel phishing URL detection techniques are promising and which need further refinement before real-world deployment.-The standardized benchmark and test set facilitates rapid iteration between developing new models and evaluating their performance against multiple baselines using a common dataset.
• Moreover, the dataset holds potential for various applications, including benchmarking novel phishing URL detection techniques against established methods, conducting research to identify innovative features for distinguishing phishing URLs through in-depth analysis of the examples, developing browser plugins or email filters for phishing URL detection, creating visualization tools to emphasize distinctions between phishing and legitimate URLs, and establishing public blacklists or blocklists containing verified phishing URLs based on the gathered samples.•For enhanced clarity, the data specifications are presented in Table

Table 2
Dataset attributes.Label indicating this URL has been verified as phishing by PhishTank.